library(tidyverse)
library(knitr)
library(kableExtra) #kable_styling()
library(DT) #datatable(), formatRound()
library(readxl) #read_excel()
library(lubridate) #year()
library(forecast) #ggAcf()
library(imputeTS) #ts(), statsNA(), all imputation functions, ggplot_na_imputations()
library(plotly) #ggplotly()
Time series is a sequence of data that has been observed at successive points in time, usually assumed to be in equally spaced time intervals. There are many domains where time series analysis is applicable and important.
Example applications:
A long term upward or downward movement of the data
Example:
A recurring and periodic pattern that completes itself within a specific time period. The amount of time it takes to get from one peak to another is known as the period.
Example:
When a variable is highly dependent on its past values, meaning correlation of a variable with itself.
This is what makes time series data so unique. You cannot perform simple linear regression on time series data due to autocorrelation, which fails the requirement for independent and identically distributed observations when performing linear regression. With time series data, the past informs the future. You can represent autocorrelation graphically using the ggAcf() function from the forecast package.
ggAcf(beersales) #input must be a time series object
Any observation outside of the dotted blue line means it has a high autocorrelation; therefore, it is statistically significant and should be included in the autoregressive component of our model.
As you will see in the final form of a time series regression model, autocorrelation is not modeled directly to the observations, but to the residuals of the trend and seasonality model.
The error, or residual, is the natural variability in the data once trend, seasonality, and autocorrelation is accounted for in the regression model.
A regression model for time series data takes the following form:
\[Y_t = \beta_0 + \beta_1t + \sum_{i=1}^{L-1}\beta_{i+1}X_i + \epsilon_t\] where \[\epsilon_t = \sum_{j=1}^k\phi_j\epsilon_{t-j}+w_t\]
where the \(\beta_1\) term models trend, the summation term models the \(L\) number of seasons for seasonality, the \(\epsilon\) term models the autocorrelation, and \(w_t\) is the remaining error (residuals) in the data.
How to actually model and analyze a time series regression model is beyond the scope of this tutorial. We will only address how to fill in missing data values for time series data. To get started with this tutorial, all you need is a time series data set observed at equally spaced time steps.
I came across time series data while working on my fourth year capstone project this year. My team worked to develop a forecasting model that would predict energy consumption for the University of Virginia’s Fontaine Research Park. As with most time series data, there were holes in the data. Generally, these missing data values can come from faulty sensor readings, unexpected communication errors, power outages, markets are closed for the day, etc. Missing values in time series is particularly troublesome when it comes to time series modeling and analysis.
This tutorial will provide different methods to address missing data values in time series data. Evident from the list of applications for time series analysis, anyone working with this type of data can benefit from this tutorial.
My capstone team was given proxy building data on past electricity, heating, and cooling usage, along with temperature in Fahrenheit and relative humidity. We also extracted dates from the academic calendar to determine whether classes were in session or if there was a break. Due to irregular building usage during the COVID-19 pandemic, we removed data after March 2020. For the sake of simplicity, this tutorial will only be looking at the electricity usage data in kilowatts for Minor Hall at UVA. The final data set was 15-min interval data from February 1, 2019 to February 1, 2020, which compromises of 35,040 observations and 9 variables.
Once you have a compiled data set, you want to explore your data and understand what you are working with. The first step is to make a simple plot of your time series data.
p <- ggplot(data) + geom_line(aes(x=Timestamp, y=electricity)) + ylab('electricity in kW')
ggplotly(p)
Our univariate time series does not appear to have much of a trend, but does indicate strong seasonality. Understanding the trend, seasonality, and autocorrelation of your univariate time series will help you narrow down which imputation algorithm is the best to use.
We can see gaps in our graph around the first week of July and November. We can also see that electricity usage was recorded as 0 a few times. Knowing that an academic building at a university is never really shut down and is always running at a minimum of electricity, we assumed this was an error and replaced all 0’s with NA.
data$electricity[data$electricity == 0] <- NA #replace 0 with NA
Since we are working with time series, let’s go ahead and convert our three time series in the data set to time series objects using ts() from the imputeTS package.
elec.ts <- ts(data$electricity)
temp.ts <- ts(data$tempF)
rh.ts <- ts(data$rh)
Next, you want to see exactly how many missing data values there are, if any. If you are lucky, there won’t be any! But if that were the case, then you probably wouldn’t be reading this tutorial, so let’s continue exploring…
## electricity data
length(which(is.na(elec.ts))) #552 rows with NA (out of 35,040 rows)
## [1] 552
statsNA(elec.ts) #we can see that there are two large chunks of missing electricity data
## [1] "Length of time series:"
## [1] 35040
## [1] "-------------------------"
## [1] "Number of Missing Values:"
## [1] 552
## [1] "-------------------------"
## [1] "Percentage of Missing Values:"
## [1] "1.58%"
## [1] "-------------------------"
## [1] "Number of Gaps:"
## [1] 22
## [1] "-------------------------"
## [1] "Average Gap Size:"
## [1] 25.09091
## [1] "-------------------------"
## [1] "Stats for Bins"
## [1] " Bin 1 (8760 values from 1 to 8760) : 10 NAs (0.114%)"
## [1] " Bin 2 (8760 values from 8761 to 17520) : 256 NAs (2.92%)"
## [1] " Bin 3 (8760 values from 17521 to 26280) : 5 NAs (0.0571%)"
## [1] " Bin 4 (8760 values from 26281 to 35040) : 281 NAs (3.21%)"
## [1] "-------------------------"
## [1] "Longest NA gap (series of consecutive NAs)"
## [1] "263 in a row"
## [1] "-------------------------"
## [1] "Most frequent gap size (series of consecutive NA series)"
## [1] "1 NA in a row (occuring 13 times)"
## [1] "-------------------------"
## [1] "Gap size accounting for most NAs"
## [1] "263 NA in a row (occuring 1 times, making up for overall 263 NAs)"
## [1] "-------------------------"
## [1] "Overview NA series"
## [1] " 1 NA in a row: 13 times"
## [1] " 2 NA in a row: 3 times"
## [1] " 3 NA in a row: 1 times"
## [1] " 4 NA in a row: 1 times"
## [1] " 5 NA in a row: 2 times"
## [1] " 253 NA in a row: 1 times"
## [1] " 263 NA in a row: 1 times"
## temperature data
length(which(is.na(temp.ts))) #1059 rows with NA (out of 35,040 rows)
## [1] 1059
## relative humidity data
length(which(is.na(rh.ts))) #773 rows with NA (out of 35,040 rows)
## [1] 773
## out of the 552 rows of missing electricity data, 543 also have missing temperature data
length(which(is.na(elec.ts) & is.na(temp.ts)))
## [1] 543
## out of the 552 rows of missing electricity data, 268 also have missing temperature and relative humidity data
length(which(is.na(elec.ts) & is.na(temp.ts) & is.na(rh.ts)))
## [1] 268
## extract list of indices that have missing electricity data
missing <- which(is.na(elec.ts))
# paged table of missing electricity observations
DT::datatable(data[missing,], options = list(paging=TRUE)) %>% formatRound(c('electricity','tempF','rh'),4)
Out of the 35,040 observations:
A univariate time series is a series of observations for a single variable at successive points in time. The singular columns for the electricity, temperature, or relative humidity are examples of univariate time series data.
| Timestamp | electricity |
|---|---|
| 2019-02-01 00:00:00 | 32.00000 |
| 2019-02-01 00:15:00 | 27.66667 |
| 2019-02-01 00:30:00 | 29.33333 |
| 2019-02-01 00:45:00 | 27.66667 |
| 2019-02-01 01:00:00 | 28.66667 |
| 2019-02-01 01:15:00 | 28.00000 |
| Timestamp | rh |
|---|---|
| 2019-02-01 00:00:00 | 47.93167 |
| 2019-02-01 00:15:00 | 49.95295 |
| 2019-02-01 00:30:00 | 53.71714 |
| 2019-02-01 00:45:00 | 55.71993 |
| 2019-02-01 01:00:00 | 57.47644 |
| 2019-02-01 01:15:00 | 57.79555 |
| Timestamp | tempF |
|---|---|
| 2019-02-01 00:00:00 | 28.15032 |
| 2019-02-01 00:15:00 | 28.15032 |
| 2019-02-01 00:30:00 | 27.47295 |
| 2019-02-01 00:45:00 | 27.13426 |
| 2019-02-01 01:00:00 | 27.13426 |
| 2019-02-01 01:15:00 | 27.13426 |
This tutorial will demonstrate the algorithms used to impute missing univariate time series data on the electricity data. The remaining variables in our data set were used later on for time series regression. The imputeTS R package is the only package that is solely dedicated to univariate time series imputation, so this tutorial will focus on the algorithms within this package.
You simply take the mean, median, or mode of the existing time series data and fill in the missing values with that value. We will demonstrate using the na_mean() function from the imputeTS package.
mean.elec.ts <- na_mean(elec.ts, option = "mean")
median.elec.ts <- na_mean(elec.ts, option = "median")
mode.elec.ts <- na_mean(elec.ts, option = "mode")
You can also take grouped means. In this example, I grouped by month, meaning any missing value that was observed during January will be replaced with the mean of the other existing values that were observed in January, etc. The imputeTS package does not offer this functionality, so this is done using the na.aggregate() function from the zoo package.
monthly.mean.elec.ts <- zoo::na.aggregate(elec.ts, by=month(data$Timestamp))
This method replaces each NA with the most recent non-NA value prior. This method uses the na_locf() function from the imputeTS package. If your time series begins with missing data, then this method will fail to fill in the first missing data, and the output will still have missing data.
locf.elec.ts <- na_locf(elec.ts, option = "locf")
Let’s see what LOCF actually looks like for the time period from 7/17 (index=15968) to 7/19 (index=16220) where electricity data was originally missing:
# tabular representation
data$locf.elec.ts <- locf.elec.ts #add interpolated time series to data frame
data %>% select(Timestamp, electricity, locf.elec.ts) %>% slice(15966:15970,16218:16222) %>% kable() %>% kable_styling()
| Timestamp | electricity | locf.elec.ts |
|---|---|---|
| 2019-07-17 07:15:00 | 12.33333 | 12.33333 |
| 2019-07-17 07:30:00 | 19.00000 | 19.00000 |
| 2019-07-17 07:45:00 | NA | 19.00000 |
| 2019-07-17 08:00:00 | NA | 19.00000 |
| 2019-07-17 08:15:00 | NA | 19.00000 |
| 2019-07-19 22:15:00 | NA | 19.00000 |
| 2019-07-19 22:30:00 | NA | 19.00000 |
| 2019-07-19 22:45:00 | NA | 19.00000 |
| 2019-07-19 23:00:00 | 18.00000 | 18.00000 |
| 2019-07-19 23:15:00 | 15.00000 | 15.00000 |
# graphical representation
ggplot_na_imputations(elec.ts[15900:16400], locf.elec.ts[15900:16400])
This method is very similar to LOCF, but you take the next non-NA value and fill in the missing data backwards instead. This method is implemented using the option='nocb' option from the same na_locf() function from the imputeTS package.
nocb.elec.ts <- na_locf(elec.ts, option = "nocb")
# tabular representation
data$nocb.elec.ts <- nocb.elec.ts #add interpolated time series to data frame
data %>% select(Timestamp, electricity, locf.elec.ts, nocb.elec.ts) %>% slice(15966:15970,16218:16222) %>% kable() %>% kable_styling()
| Timestamp | electricity | locf.elec.ts | nocb.elec.ts |
|---|---|---|---|
| 2019-07-17 07:15:00 | 12.33333 | 12.33333 | 12.33333 |
| 2019-07-17 07:30:00 | 19.00000 | 19.00000 | 19.00000 |
| 2019-07-17 07:45:00 | NA | 19.00000 | 18.00000 |
| 2019-07-17 08:00:00 | NA | 19.00000 | 18.00000 |
| 2019-07-17 08:15:00 | NA | 19.00000 | 18.00000 |
| 2019-07-19 22:15:00 | NA | 19.00000 | 18.00000 |
| 2019-07-19 22:30:00 | NA | 19.00000 | 18.00000 |
| 2019-07-19 22:45:00 | NA | 19.00000 | 18.00000 |
| 2019-07-19 23:00:00 | 18.00000 | 18.00000 | 18.00000 |
| 2019-07-19 23:15:00 | 15.00000 | 15.00000 | 15.00000 |
# graphical representation
ggplot_na_imputations(elec.ts[15900:16400], nocb.elec.ts[15900:16400])
This method replaces missing values by a weighted moving average. The k parameter determines the width of the window size. In the case of long NA gaps, the window size increases incrementally until at least \(k\) observations on each side of the missing value is non-NA. If you decide to replace missing values by a moving average imputation algorithm, the k parameter can act as a tuning parameter.
The weighting parameter determines the weight of each observation in the window for calculating the mean:
Simple Moving Average: all observations in the window are equally weighted.
Linear Weighted Moving Average: the weights decrease in arithmetical progression. The observations directly next to the central value have a weight of \(\frac{1}{2}\), the observations after that have a weight of \(\frac{1}{3}\), and so forth.
Exponential Weighted Moving Average: the weights decrease exponentially. The observations directly next to the central value have a weight of \(\frac{1}{2}\), the observations after that have a weight of \(\frac{1}{4}\) (\(\frac{1}{2^2}\)), and so forth.
s.ma.elec.ts <- na_ma(elec.ts, k=3, weighting='simple')
l.ma.elec.ts <- na_ma(elec.ts, k=3, weighting='linear')
e.ma.elec.ts <- na_ma(elec.ts, k=3, weighting='exponential')
# tabular representation
data$s.ma.elec.ts <- s.ma.elec.ts #add interpolated time series to data frame
data$l.ma.elec.ts <- l.ma.elec.ts #add interpolated time series to data frame
data$e.ma.elec.ts <- e.ma.elec.ts #add interpolated time series to data frame
data %>% select(Timestamp, electricity, s.ma.elec.ts, l.ma.elec.ts, e.ma.elec.ts) %>% slice(15966:15970,16218:16222) %>% kable() %>% kable_styling()
| Timestamp | electricity | s.ma.elec.ts | l.ma.elec.ts | e.ma.elec.ts |
|---|---|---|---|---|
| 2019-07-17 07:15:00 | 12.33333 | 12.33333 | 12.33333 | 12.33333 |
| 2019-07-17 07:30:00 | 19.00000 | 19.00000 | 19.00000 | 19.00000 |
| 2019-07-17 07:45:00 | NA | 13.44444 | 14.64103 | 15.66667 |
| 2019-07-17 08:00:00 | NA | 15.66667 | 16.14286 | 16.77778 |
| 2019-07-17 08:15:00 | NA | 15.66667 | 16.03704 | 16.77778 |
| 2019-07-19 22:15:00 | NA | 16.50000 | 16.66667 | 17.00000 |
| 2019-07-19 22:30:00 | NA | 16.50000 | 16.71429 | 17.00000 |
| 2019-07-19 22:45:00 | NA | 15.77778 | 16.23077 | 16.61905 |
| 2019-07-19 23:00:00 | 18.00000 | 18.00000 | 18.00000 | 18.00000 |
| 2019-07-19 23:15:00 | 15.00000 | 15.00000 | 15.00000 | 15.00000 |
# graphical representation
ggplot_na_imputations(elec.ts[15900:16400], s.ma.elec.ts[15900:16400], title="Simple Moving Average")
ggplot_na_imputations(elec.ts[15900:16400], l.ma.elec.ts[15900:16400], title = "Linear Weighted Moving Average")
ggplot_na_imputations(elec.ts[15900:16400], e.ma.elec.ts[15900:16400], title = "Exponential Weighted Moving Average")
You can choose which type of interpolation you want to perform with the option parameter:
Linear: fits a straight line between two non-NA points.
Spline: performs spline interpolation
Stineman: performs interpolation using the Stineman algorithm
linear.elec.ts <- na_interpolation(elec.ts, option = "linear")
spline.elec.ts <- na_interpolation(elec.ts, option = "spline")
stine.elec.ts <- na_interpolation(elec.ts, option = "stine")
# tabular representation
data$linear.elec.ts <- linear.elec.ts #add interpolated time series to data frame
data$spline.elec.ts <- spline.elec.ts #add interpolated time series to data frame
data$stine.elec.ts <- stine.elec.ts #add interpolated time series to data frame
data %>% select(Timestamp, electricity, locf.elec.ts, nocb.elec.ts, linear.elec.ts, spline.elec.ts, stine.elec.ts) %>% slice(15966:15970,16218:16222) %>% kable() %>% kable_styling()
| Timestamp | electricity | locf.elec.ts | nocb.elec.ts | linear.elec.ts | spline.elec.ts | stine.elec.ts |
|---|---|---|---|---|---|---|
| 2019-07-17 07:15:00 | 12.33333 | 12.33333 | 12.33333 | 12.33333 | 12.33333 | 12.33333 |
| 2019-07-17 07:30:00 | 19.00000 | 19.00000 | 19.00000 | 19.00000 | 19.00000 | 19.00000 |
| 2019-07-17 07:45:00 | NA | 19.00000 | 18.00000 | 18.99606 | 26.33327 | 18.99673 |
| 2019-07-17 08:00:00 | NA | 19.00000 | 18.00000 | 18.99213 | 33.57935 | 18.99345 |
| 2019-07-17 08:15:00 | NA | 19.00000 | 18.00000 | 18.98819 | 40.73860 | 18.99017 |
| 2019-07-19 22:15:00 | NA | 19.00000 | 18.00000 | 18.01181 | 28.94924 | 18.01612 |
| 2019-07-19 22:30:00 | NA | 19.00000 | 18.00000 | 18.00787 | 25.29936 | 18.01077 |
| 2019-07-19 22:45:00 | NA | 19.00000 | 18.00000 | 18.00394 | 21.64950 | 18.00540 |
| 2019-07-19 23:00:00 | 18.00000 | 18.00000 | 18.00000 | 18.00000 | 18.00000 | 18.00000 |
| 2019-07-19 23:15:00 | 15.00000 | 15.00000 | 15.00000 | 15.00000 | 15.00000 | 15.00000 |
# graphical representation
ggplot_na_imputations(elec.ts[15900:16400], linear.elec.ts[15900:16400], title="Linear Interpolation")
ggplot_na_imputations(elec.ts[15900:16400], spline.elec.ts[15900:16400], title="Spline Interpolation")
ggplot_na_imputations(elec.ts[15900:16400], stine.elec.ts[15900:16400], title="Stineman Interpolation")
Based on the graph, we can clearly see that spline interpolation does not fit our data. Comparing linear interpolation and Stineman interpolation, there appears to only be a slight difference.
This method applies Kalman Smoothing on either structural time series models or on the state space representation of an arima model by setting the model parameter to "StructTS" for a structural model and "auto.arima" for an arima model. Set the option smooth=TRUE for imputation.
s.kalman.elec.ts <- na_kalman(elec.ts, model='StructTS', smooth=TRUE)
a.kalman.elec.ts <- na_kalman(elec.ts, model='auto.arima', smooth=TRUE)
# tabular representation
data$s.kalman.elec.ts <- s.kalman.elec.ts #add interpolated time series to data frame
data$a.kalman.elec.ts <- a.kalman.elec.ts #add interpolated time series to data frame
data %>% select(Timestamp, electricity, locf.elec.ts, nocb.elec.ts, linear.elec.ts, stine.elec.ts, s.kalman.elec.ts, a.kalman.elec.ts) %>% slice(15966:15970,16218:16222) %>% kable() %>% kable_styling()
| Timestamp | electricity | locf.elec.ts | nocb.elec.ts | linear.elec.ts | stine.elec.ts | s.kalman.elec.ts | a.kalman.elec.ts |
|---|---|---|---|---|---|---|---|
| 2019-07-17 07:15:00 | 12.33333 | 12.33333 | 12.33333 | 12.33333 | 12.33333 | 12.33333 | 12.33333 |
| 2019-07-17 07:30:00 | 19.00000 | 19.00000 | 19.00000 | 19.00000 | 19.00000 | 19.00000 | 19.00000 |
| 2019-07-17 07:45:00 | NA | 19.00000 | 18.00000 | 18.99606 | 18.99673 | 15.68387 | 15.53067 |
| 2019-07-17 08:00:00 | NA | 19.00000 | 18.00000 | 18.99213 | 18.99345 | 15.68789 | 15.37743 |
| 2019-07-17 08:15:00 | NA | 19.00000 | 18.00000 | 18.98819 | 18.99017 | 15.69192 | 15.67843 |
| 2019-07-19 22:15:00 | NA | 19.00000 | 18.00000 | 18.01181 | 18.01612 | 16.68953 | 16.76498 |
| 2019-07-19 22:30:00 | NA | 19.00000 | 18.00000 | 18.00787 | 18.01077 | 16.69355 | 16.71890 |
| 2019-07-19 22:45:00 | NA | 19.00000 | 18.00000 | 18.00394 | 18.00540 | 16.69757 | 16.82498 |
| 2019-07-19 23:00:00 | 18.00000 | 18.00000 | 18.00000 | 18.00000 | 18.00000 | 18.00000 | 18.00000 |
| 2019-07-19 23:15:00 | 15.00000 | 15.00000 | 15.00000 | 15.00000 | 15.00000 | 15.00000 | 15.00000 |
# graphical representation
ggplot_na_imputations(elec.ts[15900:16400], s.kalman.elec.ts[15900:16400], title="Structural Time Series Model + Kalman Smoothing")
ggplot_na_imputations(elec.ts[15900:16400], a.kalman.elec.ts[15900:16400], title="State Space Representation of ARIMA Model + Kalman Smoothing")
This method removes the seasonal component of the time series data, performs imputation on the deseasonalized series, and then adds the seasonality back again to the time series data.
You can perform any of the imputation methods above on the deseasonalized data with the algorithm parameter:
algorithm="interpolation" performs linear interpolation
algorithm="locf" performs LOCF imputation
algorithm="mean" performs mean imputation
algorithm="kalman" performs Kalman Smoothing imputation
algorithm="ma" performs exponential weighted moving average imputation
inter.seadec.elec.ts <- na_seadec(elec.ts, algorithm="interpolation", find_frequency=TRUE)
ma.seadec.elec.ts <- na_seadec(elec.ts, algorithm="ma", find_frequency=TRUE)
# tabular representation
data$inter.seadec.elec.ts <- inter.seadec.elec.ts #add interpolated time series to data frame
data$ma.seadec.elec.ts <- ma.seadec.elec.ts #add interpolated time series to data frame
data %>% select(Timestamp, electricity, locf.elec.ts, nocb.elec.ts, linear.elec.ts, stine.elec.ts, inter.seadec.elec.ts, ma.seadec.elec.ts) %>% slice(15966:15970,16218:16222) %>% kable() %>% kable_styling()
| Timestamp | electricity | locf.elec.ts | nocb.elec.ts | linear.elec.ts | stine.elec.ts | inter.seadec.elec.ts | ma.seadec.elec.ts |
|---|---|---|---|---|---|---|---|
| 2019-07-17 07:15:00 | 12.33333 | 12.33333 | 12.33333 | 12.33333 | 12.33333 | 12.33333 | 12.33333 |
| 2019-07-17 07:30:00 | 19.00000 | 19.00000 | 19.00000 | 19.00000 | 19.00000 | 19.00000 | 19.00000 |
| 2019-07-17 07:45:00 | NA | 19.00000 | 18.00000 | 18.99606 | 18.99673 | 18.92412 | 15.96279 |
| 2019-07-17 08:00:00 | NA | 19.00000 | 18.00000 | 18.99213 | 18.99345 | 18.89284 | 16.27136 |
| 2019-07-17 08:15:00 | NA | 19.00000 | 18.00000 | 18.98819 | 18.99017 | 19.30643 | 17.40136 |
| 2019-07-19 22:15:00 | NA | 19.00000 | 18.00000 | 18.01181 | 18.01612 | 18.84648 | 17.89557 |
| 2019-07-19 22:30:00 | NA | 19.00000 | 18.00000 | 18.00787 | 18.01077 | 18.82775 | 17.52069 |
| 2019-07-19 22:45:00 | NA | 19.00000 | 18.00000 | 18.00394 | 18.00540 | 18.12969 | 16.69145 |
| 2019-07-19 23:00:00 | 18.00000 | 18.00000 | 18.00000 | 18.00000 | 18.00000 | 18.00000 | 18.00000 |
| 2019-07-19 23:15:00 | 15.00000 | 15.00000 | 15.00000 | 15.00000 | 15.00000 | 15.00000 | 15.00000 |
# graphical representation
ggplot_na_imputations(elec.ts[15900:16400], inter.seadec.elec.ts[15900:16400], title="Seasonally decomposed linear interpolation")
ggplot_na_imputations(elec.ts[15900:16400], ma.seadec.elec.ts[15900:16400], title="Seasonally decomposed exponential weighted moving average imputation")
Imputation is always very dependent on the characteristics of the time series data you are working with. It will depend on the trend, seasonality, and autocorrelation of your data.
The data we used to demonstrate different imputation algorithms was electricity usage in kW for Minor Hall at the University of Virginia. This univariate time series data had 552 rows of missing data. A plot of our time series data revealed that there wasn’t a trend, but there was strong seasonality.
Mean, median, mode, LOCF, and NOCB imputation are all simple and fast algorithms, but present a clear problem of not capturing trend or seasonality. Thus, these algorithms are computationally advantageous, but not robust. These methods are particularly problematic when there are large gaps in the time series, as can be seen in our data from 7/17-7/19. These methods simply fit a horizontal line and did not capture any of the seasonality in our data. In some cases however, they may be enough, especially for time series with no trend or seasonality.
Moving average imputation takes mean imputation one step further by restricting the mean to observations around the missing data. This algorithm is a better choice than regular mean imputation if your time series happens to have a strong trend.
For time series with strong seasonality, the Kalman Smoothing and Seasonally Decomposed imputation algorithms work best since they directly consider the seasonality of the data.
Overall, the seasonally decomposed imputation was the algorithm that best fit our electricity time series data because it was the only one that captured seasonality.